The iRefIndex is a collection of protein interactions databases providing and index of canonical interaction pairs and references to the database providing evidence for the interaction. The purpose of this notebook is to extract a binary feature for each database integrated into iRefIndex. These databases are:
To extract this feature we will iterate over the table and use each Entrez Gene protein pair as a key to index the database referring to each entry:
In [1]:
cd ../../iRefIndex/
In [4]:
import csv
In [13]:
import pdb
In [24]:
f = open("9606.mitab.08122013.txt")
c = csv.reader(f,delimiter="\t")
irefindexdict = {}
for l in c:
#extract Gene IDs
gids = []
for x in [l[2],l[3]]:
for s in x.split("|"):
s = s.split(":")
if s[0]=="entrezgene/locuslink":
gids.append(s[1])
#only add entry to dictionary if there is a pair of Gene IDs
if len(gids) == 2:
try:
irefindexdict[frozenset(gids)] += [l[12]]
except KeyError:
irefindexdict[frozenset(gids)] = [l[12]]
f.close()
Now we find the strings corresponding to unique databases:
In [26]:
uniqdbs = list(set(flatten(irefindexdict.values())))
print uniqdbs
Using these we can create a dictionary using the same keys as above but using a 1-of-k coding for each database:
In [27]:
ireffeaturedict = {}
for k in irefindexdict.keys():
fvector = []
for db in uniqdbs:
if db in irefindexdict[k]:
fvector.append("1")
else:
fvector.append("0")
ireffeaturedict[k] = fvector
These results will be saved in two ways:
In [29]:
f = open("human.iRefIndex.Entrez.1ofk.txt", "w")
c = csv.writer(f,delimiter="\t")
c.writerow(["protein1","protein2"]+uniqdbs)
for k in ireffeaturedict.keys():
pair = list(k)
if len(pair) == 1:
pair = pair*2
c.writerow(pair + ireffeaturedict[k])
f.close()
In [30]:
!head human.iRefIndex.Entrez.1ofk.txt
In [31]:
import sys
In [32]:
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")
In [35]:
import ocbio.irefindex
In [37]:
features = ocbio.irefindex.features(ireffeaturedict)
In [38]:
import pickle
In [39]:
f = open("human.iRefIndex.Entrez.1ofk.pickle","wb")
pickle.dump(features,f)
f.close()